Events

Events

SplitSQL: Practical Pushdown Cache for DataLake Analytics (Xiangpeng Hao)

Date

Tue Jan 21, 2025

Time

12:00pm EST

Location

GHC 8115

Speaker

Xiangpeng Hao

Modern data analytics embrace a disaggregated architecture which decouples storage, cache, and compute into network-connected independent components. With disaggregated cache, a key design decision is whether to push down query predicates to the cache server. Without predicate pushdown, the cache must send all data to compute nodes, creating network bottlenecks. With predicate pushdown, the cache server evaluates predicates on cached data, but its limited computational resources become the bottleneck.

In this talk, we introduce SplitSQL, a pushdown cache system with efficient predicate evaluation. Our system is built upon a surprising observation: pushdown cost is dominated by decoding data, not predicate evaluations. SplitSQL reduces decoding overhead by transcoding storage formats (like Parquet) into a cache-optimized format that enables predicate evaluation on encoded data and supports efficient, fine-grained decoding.

Implemented on Apache DataFusion, SplitSQL achieves both low network traffic and significantly reduced computational overhead compared to conventional pushdown systems.
Experiments on ClickBench show that SplitSQL’s cache-specific format delivers up to 3x end-to-end performance improvement while maintaining compression ratio on par with the original storage format.

Bio:
Xiangpeng Hao is a PhD student at the University of Wisconsin-Madison studying computer science with a focus on database/storage systems.